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Abstract. Spectral clustering is a broad class of clustering procedures 
in which an intractable combinatorial optimization formulation of clus- 
tering is "relaxed" into a tractable eigenvector problem, and in which 
the relaxed solution is subsequently "rounded" into an approximate dis- 
crete solution to the original problem. In this paper we present a novel 
margin-based perspective on multiway spectral clustering. We show 
that the margin-based perspective illuminates both the relaxation and 
rounding aspects of spectral clustering, providing a unified analysis of 
existing algorithms and guiding the design of new algorithms. We also 
present connections between spectral clustering and several other top- 
ics in statistics, specifically minimum-variance clustering, Procrustes 
analysis and Gaussian intrinsic autoregression. 

Key words and phrases: Spectral clustering, spectral relaxation, graph 
partitioning, reproducing kernel Hilbert space, large-margin classifica- 
tion, Gaussian intrinsic autoregression. 



1. INTRODUCTION 

Spectral clustering is a promising approach to clus- 
tering that has recently been undergoing rapid de- 
velopment (Shi and Malik (2000); Kannan, Vempala 
and Vetta (2000); Zha et al. (2002); Ng, Jordan and 
Weiss (2002); Shortreed and Meila (2005); Ding, He 
and Simon (2005); Bach and Jordan (2006); von 
Luxburg (2007)). In the spectral framework a clus- 
tering problem is posed as a discrete optimization 
problem (an integer program). This problem is gen- 
erally intractable computationally, and approximate 
solutions are obtained by a two-step procedure in 
which (1) the problem is "relaxed" into a simplified 
continuous optimization problem that can be solved 
efficiently, and (2) the resulting continuous solution 
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is "rounded" into an approximate solution to the 
original discrete problem. The adjective "spectral" 
refers to the fact that the relaxed problem gener- 
ally takes the form of an eigenvector problem (the 
original objective function involves quadratic con- 
straints, which yields a Rayleigh coefficient in the 
relaxed problem). 

The solutions of the relaxed problem are often re- 
ferred to as spectral embeddings and have applica- 
tions outside of the clustering context (Belkin and 
Niyogi (2002)). Our focus here, however, will be on 
spectral clustering. 

Spectral clustering was first developed in the con- 
text of graph partitioning problems (Donath and 
Hofmann (1973); Fiedler (1973)), where the problem 
is to partition a weighted graph into disjoint pieces, 
minimizing the sum of the weights of the edges link- 
ing the disjoint pieces. The methodology is applied 
to data analysis problems by identifying nodes of 
the graph with data points and identifying the edge 
weights with the similarity (or "distance") function 
used in clustering. The problem then is to choose an 
appropriate relaxation of the weighted graph parti- 
tioning problem and an appropriate rounding proce- 
dure. The current literature offers many such choices 
(see, e.g., von Luxburg (2007)). 
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Naive formulations of graph cut problems yield that this idea can be extended to general multiway 
uninteresting solutions in which single nodes are sep- Pcut spectral relaxation, where the intuitive idea of 
arated from the rest of the graph. The spectral formulataongap" can be expressed precisely using ideas from 
becomes interesting (and computationally intractable) the classification literature, specifically the idea of a 
when some sort of constraint is imposed so that the multiclass margin. 



partition is balanced. There have been two main ap- 
proaches to imposing balancing constraints. In the 
ratio cut (Rcut) formulation (Chan, Schlag and 
Zien (1994)), the constraints are expressed in terms 
of cardinalities of subsets of nodes. In the normalized 
cut (Ncut) formulation (Shi and Malik (2000)), the 
constraints are expressed in terms of the degrees of 
nodes. In this paper we study a general penalized cut 
(Pcut) formulation that includes Rcut and Ncut 
as special cases and we emphasize the close rela- 
tionships between the spectral relaxations resulting 
from Rcut and Ncut formulations. 

A seemingly very different approach to clustering 
is the classical minimum- variance formulation where 
one minimizes the trace of the pooled within-class 
covariance matrix (Webb (2002)). As we show, how- 
ever, this formulation is closely related to Pcut. 
In particular, posing the minimum-variance prob- 
lem in the reproducing kernel Hilbert space (RKHS) 
defined by a kernel function (Wahba (1990)), we 
establish a connection between spectral relaxation 
and minimum-variance clustering by treating the 
Laplacian matrix in the PCUT formulation as the 
Moore-Penrose inverse of the kernel matrix in the 
minimum-variance formulation. 

Other forms of clustering procedures have been 
usefully analyzed in terms of their relationships to 
discrimination or classification procedures 
(Webb (2002)), and in the current paper we aim 
to develop connections of this kind in the case of 
spectral clustering. In this regard, it is important 
to note that our focus is on the multiway cluster- 
ing problem, in which a data set is directly parti- 
tioned into c sets where c> 2. This differs from the 
classical graph-partitioning literature, where the fo- 
cus has been on algorithms that partition a graph 
into two pieces ( "binary cuts" ) , with the problem of 
partitioning a graph into multiple pieces ( "multiway 
cuts") often approached by the recursive invocation 
of a binary cut algorithm. 

In the case of binary cuts, an interesting connec- 
tion to classification has been established by Rahimi 
and Recht (2004), who have noted that NcuT-based 
spectral clustering can be interpreted as finding a 
hyperplane in an RKHS that falls in a "gap" in the 
empirical distribution. In the current paper we show 



Turning to the rounding problem, we note first 
that for binary cuts the rounding problem is a rela- 
tively simple problem, generally involving the choice 
of a threshold for the elements of an eigenvector 
(Juhasz and Malyusz (1977); Weiss (1999)). The 
problem is significantly more complex in the mul- 
tiway case, however, where it essentially involves an 
auxiliary clustering problem based on the spectral 
embedding. For example, Yu and Shi (2003) pro- 
posed a rounding scheme that works with an alter- 
native iteration between singular value decomposi- 
tion (SVD) and nonmaximum suppression, whereas 
Bach and Jordan (2006) devised if-means and 
weighted X-means algorithms for rounding. In the 
current paper we show that rounding can be use- 
fully approached within the framework of Procrustes 
analysis (Gower and Dijksterhuis (2004)). Moreover, 
we show that this approach again reveals links be- 
tween spectral methods and multiway classification; 
in particular, we show that the auxiliary Procrustes 
problem that we must solve can be analyzed using 
the tools of margin-based classification. 

Extant multiway spectral algorithms, including 
those of Bach and Jordan (2006) and Yu and Shi 
(2003), as well as many others (Ng, Jordan and 
Weiss (2002); Zha et al. (2002); Ding, He and Si- 
mon (2005); Shortreed and Meila (2005)), are based 
on the representation of spectral embeddings as c- 
dimensional vectors. The redundancy inherent in us- 
ing c-dimensional vectors is inconvenient, however, 
preventing the flow of results from the binary case 
to the multiway case (Shi and Malik (2000)). The 
margin-based perspective that we pursue here shows 
the value of working with a nonredundant, (c — 1)- 
dimensional representation of the spectral embed- 
ding. 

Our overall approach to spectral clustering is as 
follows. We first construct a nonredundant, margin- 
based representation of multiway spectral relaxation 
problems. Such a margin-based spectral relaxation is 
a tractable constrained eigenvalue problem. We then 
carry out a rounding scheme by solving an auxiliary 
Procrustes problem, which is again associated with 
a margin-based classification method. We refer to 
the resulting clustering framework — margin-based 
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spectral relaxation with margin-based rounding — as 
margin-based spectral clustering. 

The margin-based approach not only provides sub- 
stantial insight into the relationships among spec- 
tral clustering procedures, but it also yields proba- 
bilistic interpretations of these procedures. Specifi- 
cally, we show that the spectral relaxation obtained 
from the Pcut framework can be interpreted as 
a form of Gaussian intrinsic autoregression (Besag 
and Kooperberg (1995)). These are limiting forms of 
Gaussian conditional autoregressions (Besag (1974); 
Mardia (1988)) that retain the Markov property (two 
vertices in a graph are not connected if and only if 
their corresponding embeddings in the intrinsic au- 
toregression are conditionally independent). 

In summary, the current paper develops a mathe- 
matical perspective on spectral clustering that uni- 
fies the various algorithms that have been studied 
and emphasizes connections to other areas of statis- 
tics. Specifically we discuss connections to multi- 
way classification, reproducing kernel Hilbert space 
methods, Procrustes analysis and Gaussian intrinsic 
autoregression. 

The remainder of the paper is organized as fol- 
lows. Sections 2 and 4 describe multiway spectral 
relaxation problems based on the general Pcut for- 
mulation and the minimum variance formulation, re- 
spectively. The relationship between these two for- 
mulations is also discussed in Section 4. In Section 3 
we present two rounding schemes, one based on Pro- 
crustean transformation and the other based on K- 
means. We present a geometric perspective on spec- 
tral clustering using margin-based principles in Sec- 
tion 5, and we discuss the connection to Gaussian 
intrinsic autoregression models in Section 6. Exper- 
imental comparisons are given in Section 7 and we 
present our conclusions in Section 8. Note that sev- 
eral proofs are deferred to the Appendix. 

We use the following notation in this paper. I m 
denotes the mxm identity matrix, l m the mxlof 
ones, the zero vector or matrix zero of appropriate 
size and H m = I m — —l m l' m the mxm centering 
matrix. For an n x 1 vector a = (ai, . . . , a n )', diag(a) 
represents the n x n diagonal matrix with a± , . . . , a n 
as its diagonal entries and ||a|| is the Euclidean norm 
of a. For an mxm matrix A = [a^-], we let dg(A) be 
the diagonal matrix with an, . . . ,a mm as its diago- 
nal entries, A + be the Moore-Penrose inverse of A, 
tr(A) be the trace of A, rk(A) be the rank of A and 
||A||i? be the Frobenius norm of A. 



2. SPECTRAL RELAXATION FOR 
PENALIZED CUTS 

Given a set of n d-dimensional data points, {xi, . . . , 
x n }, our goal is to cluster the x$ into c disjoint 
classes such that each Xj belongs to one and only 
one class. We consider a graphical representation of 
this problem. Let V = {1, 2, . . . , n} denote the index 
set of the data points and consider an undirected 
graph Q = (V, £) where V is the set of nodes in the 
graph and £ is the set of edges. Associated with the 
graph is a symmetric nxn affinity matrix (also re- 
ferred to as a similarity matrix), W = [wij], defined 
on pairs of indices such that w^ > for € £ 

and Wij = otherwise. The values Wij are often ob- 
tained via a function evaluated on the correspond- 
ing pairs of data vectors; that is, Wy = ^(x^Xj) for 
some (symmetric) function ip. A variety of different 
ways to map a data set into a graph Q and an affin- 
ity matrix W have been explored in the literature; 
for a review see von Luxburg (2007). 

The problem is thus to partition V into c subsets 
Vj\ that is, Vi nVj = for i / j and (JLi Vj = V, 
where the cardinality of Vj is rij so that YTj=x n j = 
n. This problem is typically formulated 
combinatorial optimization problem. Let W(-A, B) = 
J2i<=A j£B w ij f° r t wo (possibly overlapping) subsets 
A and B of V and consider the following multiway 
penalized cut criterion: 

(2.1) Pcut = 2^ — v J ' ; v J ' 11 , 

j=l L*eVi n i 

where 7r = (tti, . . . ,7r n )' is a user-defined vector of 
weights (examples are provided below) with tti > 
for all i. The numerator of each of the terms in this 
expression is equal to the sum of the affinities on 
edges leaving the subset Vj. Thus the minimization 
of Pcut with respect to the partition {V±, . . . ,V C } 
aims at finding a partition in which edges with large 
affinities tend to stay within the individual subsets 
Vj. The denominator weights ^2 ieV . encode a no- 
tion of "size" of the subsets Vj and act to balance 
the partition. 

The Pcut criterion can also be written in ma- 
trix notation as follows. Define D = diag(Wl n ) and 
let L = D — W denote the Laplacian matrix of the 
graph. (An nxn matrix L = [Zjj] is a Laplacian ma- 
trix if la > for i = 1, . . . , n; hj = lj% < for % ^ j; 
X^j=i^«j = for i = 1, . . . ,n. Note that Laplacian 
matrices are positive semidefinite (Mohar (1991)).) 
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Let II = diag(7Ti, . . . ,7r n ) be a diagonal matrix of 
weights. Let ij G {l,...,c} denote the assignment 
of Xj to a cell in the partition and define the indica- 
tor matrix E = [ei, . . . , e n ]', where S {0, l} cxl is a 
binary vector whose ijth entry is one and all other 
entries are zero. It can now be readily verified that 
Pcut takes the following form: 

(2.2) PcuT = tr(E / LE(E'nE)~ 1 ), 

where it is helpful to note that (E'lIE) -1 is a diag- 
onal matrix, implying that Pcut is simply a scaled 
quadratic form. We wish to optimize this scaled qua- 
dratic form with respect to E. 

Two well-known examples of the Pcut problem 
are the ratio cut (Rcut) problem (Chan, Schlag 
and Zien (1994)), in which II = I n , and the normal- 
ized cut (Ncut) problem (Shi and Malik (2000)), in 
which II = D. In the Rcut problem the notion of 
"size" of a subset Vj is simply the number of nodes 
in the subset, whereas in the Ncut problem "size" 
is captured by the total degree of the nodes in the 
subset. 

The spectral clustering approach to minimizing 
Pcut involves two stages: (1) we relax the problem 
into a tractable spectral analysis problem in which 
continuous variables replace the indicators E, and 
(2) we then employ a rounding scheme to obtain 
a partition {Vi, . . . , V n } from the continuous relax- 
ation. In the remainder of this section, we focus on 
the first step (the relaxation) and we return to the 
rounding problem in Section 3. 

The standard presentation of spectral relaxation 
proceeds somewhat differently in the case of a binary 
partition and a multiway partition (von Luxburg 
(2007)). In both cases, spectral relaxation is mo- 
tivated by the observation that the Pcut criterion 
in (2.2) has the form of a Rayleigh coefficient, and 
that replacing the indicator matrix E with a real- 
valued matrix yields a classical generalized eigenvec- 
tor problem. In the binary case, the indicator ma- 
trix E has two columns, which yields two general- 
ized eigenvectors in the relaxed problem. However, 
in the subsequent rounding procedure, the problem 
is to discriminate between two classes, for which a 
single vector direction suffices. To deal with this re- 
dundancy it is standard to place a (linear) constraint 
upon the relaxation, such that it is the second gen- 
eralized eigenvector that is used for rounding (von 
Luxburg (2007)). In the multiway case, on the other 
hand, no such constraint is imposed; the redundancy 



inherent in having c generalized eigenvectors to dis- 
criminate among c classes is generally not addressed. 
(It is resolved implicitly at the rounding stage.) 

We find this distinction between the binary case 
and the multiway case to be inconvenient, and thus 
in the approach to be described in the following sec- 
tion we adopt an idea from the literature on multi- 
way classification (e.g., Zou, Zhu and Hastie (2006); 
Shen and Wang (2007)) where nonredundant, (c — 
l)-dimensional vectors are used to discriminate among 
c classes. These vectors are referred to as margin vec- 
tors. We refer the reader to the classification litera- 
ture for the geometric rationale behind the terminol- 
ogy of "margin" (although we note that a geometric 
interpretation of margin vectors will also appear in 
the current paper in Section 5.1). 

2.1 Spectral Relaxation 

To formulate a spectral relaxation of (2.2), we re- 
place the indicator matrix E with a real n x (c — 
1) matrix Y = [yi, . . . ,y n ]' ■ The following proposi- 
tion, which is based on a result of Bach and Jordan 
(2006), shows that we can express the Pcut crite- 
rion in terms of real-valued matrices Y satisfying 
certain conditions. 

Proposition 1. Let Y be an n x (c — 1) real 
matrix such that: (a) the columns of Y are piece- 
wise constant with respect to the partition E, (b) 
Y'LTY = I c _i and (c) Y ni n = 0. Then Pcut is 
equal to tr(Y'LY) . 

The proof of Proposition 1 is given in Appendix A.l. 

For this proposition to be useful it is necessary to 
show that matrices satisfying the three conditions in 
Proposition 1 exist. Condition (a) for Y is equivalent 
to the statement that Y can be expressed as Y = 
ES& where \l/ is some c x (c — 1) matrix. Thus, the 
question becomes whether there exists a \I/ such that 
Y satisfies conditions (b)-(c). In Appendix A. 2 we 
provide a general procedure for constructing such a 
This establishes the following proposition. 

Proposition 2. Matrices Y satisfying the three 
conditions in Proposition 1 exist. 

We now obtain a spectral relaxation by dropping 
condition (a). This yields the following optimization 
problem: 

min tr(Y'LY) 

Y g Rnx(c-l) 

(2.3) 

s.t. Y'nY = I c _i and Y'lll n = 0, 

which is a constrained generalized eigenvalue prob- 
lem. 
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2.2 Solving the Spectral Relaxation 

Letting Yo = II 1//2 Y, we can transform (2.3) into 
the following problem: 

min tr(Yoir 1/2 Lrr 1/2 Yo), 

YoeR™x( c - 1 ) 

(2.4) 

s.t. Y' Y = I c _! and Y^Tl 1 ' 2 ^ = 0. 

The solution to this constrained eigenvalue problem 
is given in the following theorem. 

Theorem 1 . Suppose that L is a real symmetric 
matrix such that Ll n = and suppose that the diag- 
onal entries of TI are all positive. Let n 1 = an 1//2 l n 
be the eigenvector associated with the eigenvalue 71 = 
q/rr^LIU 1 / 2 , where a 2 = 1 / (l' n Ul n ) . Let the 
remaining eigenvalues o/rU^LLU 1 / 2 be arranged 
so that 72 < ■ • ■ < j n , and let the corresponding or- 
thonormal eigenvectors be denoted by /x i; i = 2, . . . , n. 
Then the solution of problem (2.4 ) is Yo = UQ where 
U = [/*2> • • • >A*c] an d Q * s an arbitrary (c — 1) x 
(c— 1) orthonormal matrix, with min{tr(Y II _1 / 2 x 
Ln _1 / 2 Y )} = Ei=2 7i- Furthermore, if 7 C < -y c+ i, 
then Yo is a strict local minimum 0/ tr(Y II _1 / 2 x 

Ln-V2 Yo ). 

It follows from the theorem that the solution of 
problem (2.3) is Y = II^UQ. The proof of The- 
orem 1 is given in Appendix A. 3. It is important to 
note for our later work that this theorem does not 
require L to be Laplacian or even positive semidefi- 
nite. 

The condition 7 C < j c +i implies a nonzero eigen- 
gap (Chung (1997)). In practice, the eigengap is of- 
ten used as a criterion to determine the number of 
classes in clustering scenarios. An idealized situation 
is that the multiplicity of the eigenvalue zero is c. 

3. ROUNDING SCHEMES 

We now consider the problem of rounding — trans- 
forming the real-valued solution of a spectral relax- 
ation problem into a discrete set of values that can 
be interpreted as a clustering. In this section we 
present two different solutions to the rounding prob- 
lem, one based on Procrustes analysis and the other 
based on the if -means algorithm. 

3.1 Procrustean Transformation for Rounding 

In Theorem 1 we have shown that the solution 
of the spectral relaxation problem is a matrix Y = 



Algorithm 1. Spectral Clustering with Procrustean Rounding 
1: Lnput: An affinity matrix W and a diagonal ma- 
trix n 

2: Relaxation: Obtain Y = II _1//2 UQ from pro- 
blem (2.3) 

3: Initialize: Choose the initial partition E 
4: Rounding: Repeat the following procedure until 
convergence: 

(a) Recompute EG, implement the SVD of 
U'EG as U'EG = 0AV and let Q = 

ev 

(b) Recompute Y = [y^] = n _1//2 UQ, com- 
pute ti = arg maXj yij , and recompute E 
by allo- 
cating the ith data point to class ti if 
maxj yij > and to class c otherwise 

5: Output {ti,... ,t n }. 



nr^UQ, where Q is an arbitrary orthogonal ma- 
trix. We have also seen, in Proposition 1, that a 
matrix Y in which the columns of Y are piece- 
wise constant with respect to a partition E pro- 
vides a representation of the objective function value 
Pcut. If we had such a matrix Y in hand we could 
straightforwardly find the partition E: Letting ti = 
argmaxjjj/jj}, allocate Xj to the tjth class if > 
and to the cth class otherwise. On the other hand, 
if we had the partition we could attempt to find 
an orthogonal matrix Q such that Y = I1~ 1 / 2 UQ 
is as close as possible to the partition. This latter 
problem can be treated as a problem in Procrustes 
analysis (Gower and Dijksterhuis (2004)). 

Specifically, given an indicator matrix E we pose 
the following Procrustes problem: 

(3.1) argminL(Q) = tr(EG - UQ)(EG - UQ)', 
Q 

where G = [I c _i - ~l c -il' c _ 1; -£l c -i]'- This prob- 
lem has an analytical solution: Denote the singular 
value decomposition of U'EG as U'EG = 0AV. 
Then the minimizing value of Q in L is given by 
Q = 0V' (see, e.g., Mardia, Kent and Bibby (1979), 
page 416). 

We summarize this Procrustean approach to round- 
ing in algorithmic form in Algorithm 1 in the context 
of a generic spectral clustering algorithm. 

Yu and Shi (2003) have presented a rounding al- 
gorithm that is similar to the Procrustean approach 
we have presented but different in detail. The au- 
thors work with an n x c matrix Z and solve the re- 
laxation mintr(Z'LZ) subject to Z'DZ = I C . Given 
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the solution Z of this relaxation, the authors then 
compute Z = [zij] = dg(ZZ') -1 / 2 Z. Their rounding 
scheme is to allocate the ith data point to class tj 
if U — arg maXj Zjj . This method can be viewed as 
imposing a constraint; in particular, note that the 
norms of the rows of Z are equal to 1. To moti- 
vate this constraint, the authors assume that the 
solution Z can be expressed rescaling of Z: 

Z = ^(Z'DZ)" 1 / 2 . Inverting this expression yields 
Z = dg(ZZ') -1 / 2 Z. But it is not clear that a solution 
Z of the relaxation can be expressed in this form; 
the constraints on Z are not incorporated into the 
relaxation. The use of Z defined in this way must be 
viewed as a heuristic post-processing procedure. The 
Procrustean approach that we have presented in this 
section provides a resolution of this difficulty; that 
approach requires no post-processing of the matrix 
obtained from the spectral relaxation. 

We return to the Procrustean approach in Sec- 
tion 5, where we provide additional justification for 
Procrustean rounding based on a connection to mar- 
gin maximization. 

3.2 .RT-means for Rounding 

Another approach to removing the "nuisance" or- 
thogonal matrix Q is to consider rounding meth- 
ods that are invariant to rotation. The standard K- 
means algorithm provides an example, and numer- 
ous authors have proposed using X-means on the 
embedding obtained from spectral relaxation as a 
heuristic rounding procedure (von Luxburg (2007)). 
Bach and Jordan (2006) have made this approach 
more formal by showing that (weighted) X-means 
arises when the rounding problem is formalized in 
terms of a difference between projection matrices. 
In this section we review this formulation within our 
nonredundant representation of spectral relaxation. 

Let us rewrite Pcut as 

Pcut = tr(E / H 7r LH' 7r E(E / riE) _1 ), 

where we define = I n — — A— ir'l n where we use 
the fact that H^LH^ = L. Defining E^ = • 
E(E'riE)~ 1 / 2 , we observe that the number of de- 
grees of freedom of both Y and E^ is (n — l)(c — 1). 
Moreover, given that E'^IIE^ = I c - (E'nE) _1 / 2 E'7r7r' 
E(E / nE)~ 1 / 2 /(7r / l n ) and 7r / E(E / IIE) _1 E / 7r = 7r'l n , 
there exists a c x c permutation matrix P such that 



Algorithm 2. Spectral Clustering with K-means Roun 
1: Input: An affinity matrix W and a diagonal ma- 
trix n 

2: Relaxation: Obtain Y = II 1/2 UQ from 
problem 
(2.3) 

3: Initialize: Choose the initial partition E 
4: Rounding: Repeat the following procedure until 
convergence: 

(a) Compute = - £ ieV . n iyi 

(b) Find ti = argmin^ \\yi — and recom- 
pute E by allocating the ith data point to 
class ti 

5: Output {t±, . . . , t n }. 

this suggests viewing Y as an approximation to E^ 
in the metric given by II. We quantify this by defin- 
ing the following distortion between the projection 
matrices defined by Y and E^: 

JkiE^Y) = l\\YnY' - e^lte;ii! 

= c - 1 - tr(Y / nE(E / nE)~ 1 E / IIY). 

This objective function can be represented as the 
solution of a weighted ET-means problem, as shown 
by the following result which is due to Bach and 
Jordan (2006): 

Theorem 2. Let Y = [yi, . . . ,y n ]' be a solution 
of problem (2.3). For any partition {V\, . . . , V c }, the 



criterion F(m l5 . . . , m c ) = Y?j=i HieV 3 I 
achieves its minimum J/^E^Y) at raj 



m , 



pe'^iie^p' 



"Ic-l 0" 




"Y /_ 











nY,ol 



Thus by updating the mean vectors rrij in the 
weighted ET-means algorithm we match the crite- 
rion Jfc(E w , Y), and by updating the partition using 
weighted K-means we go downhill in the criterion. 

Note that in the special case of the Rcut formu- 
lation, we obtain the conventional unweighted K- 
means algorithm (given that 7Tj = 1 in that case). 

We summarize the i^-means approach to rounding 
in algorithmic form in Algorithm 2. 

4. SPECTRAL CLUSTERING AND 
MINIMUM-VARIANCE CRITERIA 

In this section and the following two sections we 
present some relationships between spectral cluster- 
ing and various topics in statistics. Our goal is both 
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to illuminate the spectral approach and to suggest 
directions for further research. 

Minimum- variance clustering is a classical approach 
to clustering (Webb (2002)). In this section, fol- 
lowing Zha et al. (2002) and Dhillon, Guan and 
Kulis (2007), we present spectral solutions to the 
minimum-variance clustering problem, and we es- 
tablish connections between minimum- variance clus- 
tering and the PCUT framework. 

Let {xi,...,x n } G X C M. d denote the observed 
data. The pooled within-class covariance matrix Sjy 
is given by 

1 c 

Sw = ~Y Y^ ~ m i)( x * ~ m i)'> 

where rrij = ^-J2i&v x «- Consider the trace of the 
within-class covariance matrix: 

1 c 

tr(S Ty ) = -^^||x i -m j || 2 . 

j=i ieVj 

Clustering algorithms which are based on the min- 
imization of this trace are referred to as minimum- 
variance methods. 

In order to establish a connection with the spec- 
tral relaxation presented in Section 2, we define a 
weighted pooled within-class covariance matrix in 
an reproducing kernel Hilbert space (RKHS) induced 
by a reproducing kernel K. In particular, assume 
that we are given the reproducing kernel K : X x 
X — > M such that K(xi,Xj) = 4>{xi)' 4>{xj) for Xj, Xj G 
X, where </>(x) is called a feature vector correspond- 
ing to a data point x G X. In the sequel, we use the 
tilde notation to denote feature vectors. Thus, the 
data matrix in the feature space is denoted as X = 
[xi,X2, . . . ,x n ]'. The centered kernel matrix takes 
the form K = H n XX'H n ; note that it is positive 
semidefinite and satisfies Kl n = 0. 

Generalizing slightly, we introduce weighted ver- 
sions of the sample covariance matrix S, the between- 
class covariance matrix Sb and the within-class co- 
variance matrix Sw'- 

1 n 

S = ^7Ti(xi - m)(5q - m)', 

2u=i i=1 



Ss = Y Y 71-4 (™J ~ ril )( lil i - ™)'> 

2^=1 *i j=1 ie y. 



Sw = Y=Tn Y Y ~ ™i)(*< ~ ™i)'> 

^i=l ni j=i i£Vj 



where the 7Tj are known positive weights, m = — 
S" = i7riXj and riij = y > 1 - Xligy, ^i*,- It; is clear 

that Svp = S — Sb- 

We now formulate a minimum-variance cluster- 
ing problem in the RKHS as the minimization of 
tr(Siy), which is given by 

1 c 

tr(sV) = — Y Y ^n** ~ ™j I' 2 - 

Like the minimization of PCUT, this minimization 
is computationally infeasible in general. It is there- 
fore natural to consider minimizing tr(Sw) by using 
the spectral relaxations presented in Section 2.2. We 
present a way to do this in the following section. 

4.1 Spectral Relaxation in the RKHS 

Let us rewrite S and as 

S = -^X'H.nH^X 

and 

s B = -4-x'H,nE(E'nE) _1 E'nH;x, 

In 

recalling that H, = I n — ^—7vl' n . This yields 

Sw = [Xh^ith'^x 

7T In 

- x'H ?r nE(E'nE) _1 E'nH' ir x]. 

The minimization of tr(Sw) is thus equivalent to 
the maximization of 

(4.1) T = tr(E'nH^ r KH 7r IIE(E'lIE)~ 1 ), 

because X'tl^nH^X is independent of E and we 
have H n H n = H n . Let A = [£?-], where 5ij is the 
squared distance between x, and Xj, that is, 

= K(xi,Xi) + K(xj,Xj) - 2K(xi,Xj). 

Given that — ^H^AHjj- = H^KH^, the minimiza- 
tion of tr(Sw) is thus equivalent to that of tr(E'II • 
H / 7r AH 7r nE(E / nE)" 1 ). 

Recall that in the proof of Proposition 1, L is 
required to satisfy only the conditions L = L' and 
Ll n = 0. Note that IIH^KH^IIl™ = 0. Thus, if Y 
is an n x (c — 1) matrix subject to the three condi- 
tions in Proposition 1, we have T = tr(Y / IIH^KH 7r • 
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IIY). This allows us to relax the maximization of T 
with respect to E as follows: 

max ti^YTIH^KH^nY) 

Ye]K nx(c~l) 

(4.2) = tr(Y'lIKnY) 

s.t. Y'lIY = I c _i and Y'm n = 0, 

where the second equality in the objective is due to 
the identity Y TIH; = Y'll. Letting Y = II^Y 
leads to 

max tr(Yon 1/2 H' 7r KH 7r n 1/2 Y ) 

Y eK nx ( c - 1 ' 

(4.3) 

s.t. Y^Y = I c _i and Y' Q Yl l ' 2 l n = 0. 

This optimization problem is solved in Appendix A. 4. 
In particular, let U be an n x (c — 1) matrix whose 
columns are the top c — 1 eigenvectors of n 1 / 2 H^ r K • 
H^II 1 / 2 . The solution of problem (4.3) is then Yo = 
UQ where Q is an arbitrary (c — 1) x (c — 1) or- 
thonormal matrix. Hence, the solution of problem 

(4.2) is Y = rr 1 / 2 UQ. 

4.2 Minimum Variance Formulations versus Pcut 
Formulations 

Since the Laplacian matrix L is symmetric and 
positive semidefinite, its Moore-Penrose (MP) in- 
verse is also positive semidefinite. Thus we can re- 
gard L as the MP inverse of a kernel matrix K and 
investigate the relationship between the spectral re- 
laxations obtained from the minimum variance and 
the Pcut formulations. In fact, we have the follow- 
ing theorem, whose proof is given in Appendix A. 5. 

Theorem 3. Assume that L + = K. If rk(L) = 
rk(K) =n — 1, thenY is the solution of problem (2.3) 
if and only if it is the solution of problem (4-2). 

Thus, an equivalent formulation of spectral clus- 
tering based on the Pcut criterion is obtained by 
considering the minimum variance criterion with K = 
L + . Note that II consists of the diagonal elements of 
K + in the Ncut setting, so it is not expedient com- 
putationally to obtain II from K — we would need 
to calculate K + . We thus suggest defining II = I n 
in the minimum-variance setting, corresponding to 
the ratio cut formulation. 

It is also possible to start from a minimum- variance 
formulation (with II = \ n ) and obtain a Rcut prob- 
lem. However, in the corresponding RCUT problem, 
the matrix K + is not guaranteed to be Laplacian, 



because the off-diagonal entries of K + are possi- 
bly positive for an arbitrary kernel matrix K. In 
this case, we can let L = K + + n/3H n where j3 = 
min{maxi^j{[K + ]ij}, 0}. Such an L is Laplacian. 
Moreover, we have tr(Y'(K + + n/3H n )Y) = 
tr(Y'K+Y) + n(c - 1)/3 due to Y'Y = I c _i and Y' • 
l n = 0. Since min(tr(Y'(K + + n/3H n )Y)) is equiva- 
lent to min(tr(Y'K + Y)), it is not necessary to com- 
pute the value of j3. 

It is worth noting that the condition rk(L) = 
rk(K) = ?7, — l is necessary. Without this condition, 
IT 1/2 Lrr 1/2 is a generalized inverse of II 1/2 H;L + 
H^II 1 / 2 , because 

n 1 / 2 H / 7r L + H 7r n 1 / 2 n- 1 /2 Ln -V2 n i/2 H ^ L + 
H,n 1 / 2 = n 1 / 2 H;L + H,n 1 / 2 , 

but it is not necessarily the MP inverse. In this 
case, it is no longer the case that II _1 / 2 LII~ 1 / 2 
and n 1 / 2 H' 7r L+H^n 1 / 2 are guaranteed to have the 
same eigenvectors associated with nonzero eigenval- 
ues. Thus, in this case, even if K = L + , the solu- 
tions of (4.2) and (2.3) are different. In summary we 
see that the spectral clustering formulations based 
on the minimum-variance criteria and Pcut, while 
closely related, are not fully equivalent. 

Dhillon, Guan and Kulis (2007) pursue a slightly 
different connection between minimum- variance cri- 
teria and spectral relaxation. They formulate the 
minimum-variance criterion via the maximization of 

(4.4) T' = tr(E / IIKIIE(E / nE)~ 1 ), 

which is readily shown to be equal to T + -zr'K • 
7r/(7r'l n ), where T is defined by (4.1). Thus the 
maximization of T 1 is equivalent to the maximiza- 
tion of T. Dhillon, Guan and Kulis (2007) then for- 
mulate the cut minimization problem as an equiva- 
lent maximization problem: 

max(E / n(n- 1 - n- 1 Ln^ 1 )nE(E / nE)~ 1 ), 

and treat II -1 — II _1 Ln 1 as K in T". However, 
II -1 — II _1 LII _1 is generally indefinite, a difficulty 
that the authors circumvent by letting K = pl n — 
L in Rcut and K = pD' 1 + D^WD -1 in Ncut, 
where p is a constant chosen to make K positive 
semidefinite. 

The idea of considering a kernel matrix that is the 
MP inverse of a Laplacian matrix will return in later 
sections, in particular in Section 5.1 where we will 
see that it allows us to provide a geometrical inter- 
pretation for spectral clustering, and in Section 6, 
where we present a probabilistic interpretation of 
spectral relaxation. 
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5. SPECTRAL CLUSTERING: A 
MARGIN-BASED PERSPECTIVE 

In this section we consider a margin-based per- 
spective on spectral clustering. First, we show that 
the margin-based perspective provides us with in- 
sight into the relationship between spectral embed- 
ding and rounding. In particular, we show that the 
problems in (2.3) and (4.2) can be understood in 
terms of the fitting of hyperplanes in an RKHS. For 
a data point x, we show that the elements of the em- 
bedding y are proportional to the signed distances 
of feature vector x to each of these hyperplanes. 
This provides support for the Procrustean round- 
ing in which rounding is achieved by nonmaximum 
suppression of the elements of y. Second, we pro- 
vide some additional direct justification for the Pro- 
crustean approach, showing that the rounding prob- 
lem can be analyzed in terms of the approximation 
of a margin-based multiway classification criterion. 

5.1 Hyperplanes in the RKHS 

Let us consider a multiway classification problem. 
That is, we consider a problem in which data points 
are pairs, (xj,ij), where ti is the label of the ith 
data point. Using the same notation as in Section 4, 
the multiway classification problem has the follow- 
ing standard formulation in an RKHS based on a 
kernel function K: 

n 

(5.1) min tr(B'KB) + ~Y f f u (B'k< + f3 ), 

i=i 

where /?(•) is a convex surrogate of the 0-1 loss, kj = 
(K(xi,Xi), . . . , K(x n ,Xi))' is the ith column of the 
kernel matrix K, B = [bi, . . . , b c _i] is an n x (c — 1) 
matrix of regression vectors, (3 Q is a (c — 1) x 1 vector 
of intercepts and 7 > is a regularization parameter. 
We can use this optimization problem as the basis 
of a clustering formulation by simply omitting the 
term ^^i=i /**(')> reflecting the fact that we have 
no labeled data in the clustering setting. We obtain 

mintr(B'KB) 
B 

(5.2) 

s.t. B'Kni n = and B'KLTKB = I c 1. 

We now consider problem (5.2) from two points of 
view. From the first point of view, we let Y = KB 
and transform (5.2) into 

mintrfY'K+Y) 
Y 

(5.3) 

s.t. Y'ni n = and Y'lTY = I c _i, 



where we have used the identity K = KK + K. It 
is readily seen that (5.3), and hence (5.2), is iden- 
tical with the spectral relaxation in (2.3) by taking 
K + = L. We also obtain a relationship between (5.3) 
and (4.2) from Section 4.2; in particular, in the spe- 
cial case in which rk(K) = n — 1, it follows from 
Theorem 3 that (5.3) and (4.2) are equivalent. 

From a second point of view, we let S = X'B (re- 
call that X is the data matrix in the feature space). 
The problem (5.2) is then transformed into 

mintr(S'S) 
s 

(5.4) „ „ „ 

s.t. S X TIl n = and S'X'IIXS = I c _i. 

Letting S = [si, . . . , s c _i] denote the solution of (5.4), 
the equations s'x = 0, j = 1, . . . , c — 1, define hy- 
perplanes that pass through the weighted centroid 
J^ =1 7TjXj of the feature vectors 5q. Moreover, the 
signed distance between feature vector Xj and the 
hyperplane s^x = is s^Xj. Recall that Y = [yij\ = 

KB = XX'B = XS. We thus have yij = s^5q. That 
is, yij is the signed distance of 5q to the jth hyper- 
plane. We can therefore interpret the spectral re- 
laxation in (2.3) and (4.2) as yielding vectors whose 
elements are — using the language of multiway classi- 
fication — margin vectors. Given this interpretation, 
it is reasonable to allocate labels by finding the max- 
imum element of (yn, . . . , g/j |C _i,0). This motivates 
the Procrustean approach to rounding, which can 
be viewed as identifying boundaries between clus- 
ters by projecting feature vectors onto hyperplanes 
in an RKHS. A graphical interpretation of this re- 
sult is provided in Figure 1. 

5.2 Margin-Based Rounding Scheme 

We can also provide a direct connection between 
classification and rounding. Let us return to the ob- 
jective function in (5.1), which we rewrite as 

n 

imntr(Y'K + Y) + ^/ ti ( yi ) 
i=l 

by letting Y = KB and setting /3 = 0. Assume that 
we have obtained a matrix Y from spectral relax- 
ation and recall that Y depends on an arbitrary or- 
thogonal matrix Q. From the classification perspec- 
tive we can view the subsequent rounding problem 
as the problem of minimizing the classification loss 
n Yh=i fti(yi) under the constraint QQ' = I c _i. In 
this section we explore some of the consequences of 
this perspective. 
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(a) (b) 

Fig. 1. Illustrations of spectral clustering in the feature space for a three-class separable example. The clustering is based on 
the signed distances of the feature vector x = 0(x) to suitably defined hyperplanes. (a) Hyperplanes in the feature space are 
represented by their normals, &j, j = 1,2,3, subject to the sum-to-zero constraints. These hyperplanes are computed from the 
vectors si and S2 obtained from spectral relaxation via ai = si — |(si + S2), a.2 = S2 — |(si + S2) and a.3 = — |(si + S2). (b) 
The hyperplanes defined by the vectors si and S2 ■ Note that si = ai — a;j and S2 = a2 — a.3 . 



In the multiway classification problem, we define 
class- conditional probabilities Pj(x) for the c classes 
j = 1, . . . , c. Using this notation, we define the ex- 
pected error at x as follows: 

c 

(5.5) /2(x,y)= J^I^P^x), 

i=i 

where t = argmax^yj or t = c if maxjyj} < and 
where IWi defines the 0-1 loss: it is 1 if # is true 
and otherwise. Since ![.] is a non-convex objective 
function that leads to an intractable optimization 
problem, the standard practice in the classification 
literature is to replace In with a "surrogate loss 
function" fj(y) that is an upper bound on the 0- 
1 loss (Bartlett, Jordan and McAuliffe (2006); Shen 
and Wang (2007)). 

The surrogate loss function that we consider in 
the current paper is the following exponential loss: 

(5-6) fj(y) = ^exp(y z - yj), 

where for convenience we extend y to a c-dimensional 
vector in which y c = 0. Note that the variables to be 
optimized are the entries of the matrix Q. Clearly, 
fj(y) is an upper bound of \t^j\i because if x does 
not belong to class j, there exists at least one yi such 
that I 7^ j and yi~yj>0, and hence exp(y^ — yj) > 1. 



This surrogate loss function also has an important 
Fisher consistency property: 

Proposition 3. Assume Pj(x) > for j = 1, 
. . . , c. We then have 

c 

Vj = argmax ^ ^ exp(y/ - y^Pj (x) 
y j=i Ift 



The proof of Proposition 3 is a straightforward cal- 
culation, so we omit it. This proposition shows that 
the surrogate loss function that we have chosen is 
justified from the point of view of classification as 
yielding a Bayes consistent rule (Bartlett, Jordan 
and McAuliffe (2006); Zou, Zhu and Hastie (2006)). 

Returning to the rounding problem, we now con- 
sider the labels {ti} as temporarily fixed and con- 
sider the empirical risk function defined over the set 
of pairs (xj,tj) given by 

1 n 

i=l i^t t 

We wish to optimize this empirical risk with respect 
to Q. This problem does not have a closed- form so- 
lution under the constraint QQ' = I c -i- However, 
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we can consider a Taylor expansion around j/jj = 0. 
We have 

n n 

J(Q)^(c-l)-^5>;y J + c 2 5>r\ 

i=i i=i 

where gj is the jth column of G' = [I c _i — ^l c _il' c _ 1 , 
— -l c _i] , and where we have used the fact that y ■gt ! • 

s'uYi = ^V/Qg^QVi < V 71 "* because I c _i - 
gtiSt i s positive semidefinite. We thus see that the 
maximization of the linear term Y27=l St Yi with re- 
spect to Q yields an approximate procedure for min- 
imizing J(Q). But this is precisely the Procrustean 
problem (3.1) discussed in Section 3. 

It would also be possible to attempt to optimize 
J{Q) directly by making use of Newton or conjugate 
gradient methods on the Stiefel manifold (Edelman, 
Arias and Smith (1999)). 

6. SPECTRAL RELAXATION: THE VIEW 
FROM GAUSSIAN INTRINSIC 
AUTOREGRESSION 

In this section we show that spectral relaxation 
can be interpreted as a model-based statistical pro- 
cedure. In particular, we present a connection be- 
tween spectral relaxation and Gaussian intrinsic au- 
toregression models. 

Our focus is the spectral relaxation problem pre- 
sented in Section 2, specifically the constrained eigen- 
value problem in (2.3). 

Recall that the Laplacian matrix L is a positive 
semidefinite matrix; moreover, the pseudoinverse L + 
is positive semidefinite and can be viewed CIS db ker- 
nel matrix. We found this perspective useful in our 
discussion of minimum-variance clustering in Sec- 
tion 4.2; note also that (Saerens et al. (2004)) have 
explored connections between spectral embedding 
and random walks on graphs using the fact that the 
elements of L + are closely related to the commute- 
time distances obtained from a random walk on the 
graph. In this section, we take the interpretation of 
L + in a different direction, using it to make the con- 
nection to Gaussian intrinsic autoregressions. 

Denote K = L + where L = D — W. Let us model 
the n X (c— 1) matrix Y as a singular matrix- variate 
normal distribution A^ njC _i(0,o" 2 K® I c -i) where we 
follow the notation for matrix- variate normal distri- 
butions in (Gupta and Nagar (2000)). That is, 

p(Y)ocexpf--^tr(Y'LY)Y 



Let us set a 2 = l/tr(IIK) so that E(Y'IIY) = a 2 
tr(IIK)I c _i = I c _i. Finally, we impose the constraint 
Y'IIl n = in order to remove the redundancy 
K + l n = in K + . We thus obtain the following propo- 
sition. 

Proposition 4. The relaxation problem in (2.3) 
is equivalent to the maximization of the log likeli- 
hood p(Y) under the constraints Y'lIY = I c _i and 

Y'ni n = 0. 

We obtain a statistical interpretation of spectral 
relaxation from the fact that a multivariate nor- 
mal distribution can be equivalently expressed as 
a Gaussian conditional autoregression (CAR) (Be- 
sag 1974; Mardia 1988). Indeed, given Y ~ iV n>c _i x 
(0, <t 2 K I c -i), we have that the yj can be charac- 
terized as (c — l)-dimensional CARs with 

[. n 

E(y»|yj, j ^ i) = -J2 r y i = -f- y ii 

(6.1) 

Var(yj| yi ,j ^i) = y-I c -i. 

Hi 

That is, we have Yi\{yj : j / i} ~ N c -l(X)"=i T^y?; 

2 

?— I c _i), for i = 1, ... ,n. Since K is positive semidefi- 
nite but not positive definite, Besag and Kooperberg 
(1995) referred to such conditional autoregressions 
as Gaussian intrinsic autoregressions. 

The CAR model implicitly requires uia = and 
hi — Y^ 1 j=\ w ij- l n spectral embedding and cluster- 
ing (Guattery and Miller (2000); Belkin and Niyogi 
(2002); Ng, Jordan and Weiss (2002)), the w tj are 
usually used to assert adjacency or similarity rela- 
tionships between the Vj. We will see shortly that 
these adjacency or similarity relationships have an 
interpretation as conditional independencies. 

Since D — W is positive semidefinite, D — cjW is 
positive definite for u £ (0,1). This fact has been 
used to devise CAR models based on D — wW such 
that E(yi|yj, j^i) =w^"=i j^rYj (see, e.g., Carlin 
and Banerjee (2003)). We now have 

As a result, = (or Wij = 0) implies that y, _L 
-L Yj\{yi '■ I that is, y$ is conditionally in- 

dependent of yj given the remaining vectors. This 
Markov property also holds for Gaussian intrinsic 
autoregressions (Besag and Kooperberg (1995)). 
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This perspective sheds light on some of the re- 
lationships between the Nout and Rcut formula- 
tions of spectral relaxation. Recall that since II = 
D in the Ncut setting, we impose the constraints 
Y'DY = I c _i and Y'Dl n = 0. On the other hand, 
the Rcut formulation uses the constraints Y'Y = 
I c _i and Y'l n = because II = I n . Theorem 1 
shows that the solution of the Ncut is based on 
n _1 / 2 Ln _1 / 2 = I n - D" 1 /2wD- 1 /2 ) w hi c h is a so- 
called normalized graph Laplacian. The solution of 
the Rcut problem is based on the unnormalized 
graph Laplacian L. Now Proposition 1 reveals a 
problematic aspect of the Ncut formulation — piece- 
wise constancy of the columns of Y is accompa- 
nied by a lack of orthogonality of these columns. 
Two natural desiderata of spectral clustering are in 
conflict in the Ncut formulation. This conflict be- 
tween orthogonality and piecewise constancy is not 
present for Rcut. However, the existing empirical 
results showed that the normalized graph Laplacian 
tends to outperform the unnormalized graph Lapla- 
cian. Moreover, von Luxburg, Belkin and Bousquet 
(2008) provided theoretical evidence of the superi- 
ority of the normalized graph Laplacian. 

This seeming paradox can be resolved by using an 
alternative choice for L in the Rcut formulation. 
Let us set L = (I n — C)'(I n — C), where C = [cij] is 
an n x n nonnegative matrix such that cu = for 
all i and Cl n = l n . Such a L is positive semidefi- 
nite but no longer Laplacian. Since Ll n = 0, we can 
still solve the spectral relaxation problem (2.4) using 
Theorem 1. 

Our experimental results in Section 7 show that 
this novel Rcut formulation is very effective. It is 
also worth noting that we can connect this formula- 
tion to the simultaneous autoregression (SAR) model 
of Besag (1974). In particular, the Vj are now spec- 
ified by n simultaneous equations: 
n 

Yi = ^CijYj + £ i> i = l,...,n, 

3=1 

where the are independent normal vectors from 
N c -i(Q, cr 2 I c _i). This equation can be written in 
matrix form as follows: 

Y=CY+S 

with S = [£i,...,e n ]'~iV re)C _i(0,(r 2 I n ®I c _i). 

We thus have Y ~ iV n)C _i(0, cr 2 K(g>I c „i) with K + = 
(I n — C)'(I n — C). In practice, we are especially con- 
cerned with the case in which C = D _1 W. It is 



worth noting that I n - D" 1 / 2 WD" 1 / 2 and l n - D 1 W 
have the same eigenvalues, while the squared sin- 
gular values of l n — D X W are the eigenvalues of 
(I„ - D^Wy^n - D-^W). We thus obtain an in- 
teresting new relationship between the Ncut for- 
mulation and the RCUT formulation. 

7. EXPERIMENTS 

Although our principal focus has been to provide a 
unifying perspective on spectral clustering, our anal- 
ysis has also provided novel spectral algorithms, and 
it is of interest to compare the performance of these 
algorithms to existing algorithms. In this section we 
report the results of experiments conducted with six 
publicly available data sets: five data sets from the 
UCI machine learning repository (the dermatology 
data, the vowel data, the NIST optical handwritten 
digit data, the letter data and the image segmenta- 
tion data) as well as a set of gene expression data 
analyzed by Yeung et al. (2001). 

In the dermatology data, there are 366 patients, 
8 of whom are excluded due to missing information, 
with 34 features. The data are clustered into six 
classes. We standardized the data to have zero mean 
and unit variance. The NIST data set contains the 
handwritten digits 0-9, where each instance consists 
of a 16 x 16 pixel and where digits are treated as 
classes. We selected 1000 digits, with 100 instances 
per digit, for our experiments. The vowel data set 
contains the eleven steady-state vowels of British 
English. The letter data set consists of images of 
the letters "A" to "Z." In our experiments we se- 
lected the first 10 letters with 195, 199, 182, 207, 203, 
210, 226, 196, 188 and 172 instances, respectively. 
The image segmentation data consist of seven types 
of images: "brickface," "sky," "foliage," "cement," 
"window," "path" and "grass." The gene data set 
contains 384 genes with 17 time points over two cell 
cycles. The data were standardized to have mean 
zero and unit variance (Yeung et al. (2001)). We 
treated the five phases of the cell cycle as five nomi- 
nal classes for these data, classifying genes into these 
classes according to their expression level peaks. Ta- 
ble 1 gives a summary of these data sets. 

We compared our rounding algorithm based on 
Procrustean transformation (see Algorithm 1) with 
those based on the rounding procedures given in 
Bach and Jordan (2006) and Yu and Shi (2003), 
conducting comparisons using the NCUT, RCUT and 
minimum- variance criteria. We refer to the weighted 
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Table 1 

Summary of the benchmark data sets 





Gene 


Dermatology 


Vowel 


NIST 


Letter 


Segmentation 


II 


384 


358 


990 


1000 


1978 


2100 


d 


17 


34 


10 


256 


16 


19 


c 


5 


6 


11 


10 


10 


7 



n — the number of samples; d — the number of features; c — the 
number of classes. 



iC-means and the iC-means algorithms of Bach and 
Jordan (2006) as BJ-wkm and BJ-km, respectively. 
Note that the spectral clustering algorithm based 
on the NCUT formulation and iC-means rounding 
is equivalent to that presented by Ng, Jordan and 
Weiss (2002). We initialized the ET-means algorithms 
by the orthogonal initialization method in Ng, Jor- 
dan and Weiss (2002). For the rounding scheme of Yu 
and Shi (2003), we used two initialization methods: 
the orthogonal initialization method and initializa- 
tion to the identity matrix. We refer to the corre- 
sponding algorithms as YS-1 and YS-2. We also 
used these two initialization methods in our algo- 
rithm (Algorithm 1), referring to the results in these 
two cases as Margin- 1 and Margin-2. 

7.1 Setup and Evaluation Criterion 

We defined the adjacency matrix W = [wij] as 
Wij = exp(— ||xj — Xj|| 2 //3) with f3 > 0. The kernel 
matrix is defined as K = H n WH n . For the margin- 
based algorithms, however, we set wu = for i = 
1, ...,n; in this case the kernel matrix is defined 
as K = H n (I n + W)H n . For simplicity, we do not 
distinguish between these two cases in our notation 
in the remainder of this section. In the minimum- 
variance formulation we always set II = I n . With 
these settings, the BJ-wkm and BJ-km algorithms 
are based on the spectral decomposition of I n — 
D- 1 /2 WD -i/2 > The YS-1 and YS-2 algorithms are 
based on the spectral decomposition of I n — D _1 W, 
and the Margin- 1 and Margin-2 algorithms are based 
on the spectral decomposition of I n — D _1 / 2 WD -1 / 2 . 

Although L = D — W is one natural choice in the 
Rcut setting, we instead adopted the suggestion in 
Section 6 and defined L as 

(7.1) L=(I n -D" 1 W) / (I n -D- 1 W). 

To simplify the comparison among procedures, we 
fixed f3 to specific sets of values for each of the data 
sets, exploring a range of values to investigate the 



relative sensitivities to the choice of f3 for the dif- 
ferent clustering algorithms. Our specific choices for 
both the Ncut and Rcut criteria were f3 G {1, 10} 
for the gene data, (3 G {1,10,100} for the "vowel" 
data, (3 G {5000, 10000, 20000} for the "image seg- 
mentation" data, and (3 G {10, 100, 1000} for the "der- 
matology," "NIST" and "letter" data sets. Since the 
minimum- variance criterion directly operates on K, 
we choose a different set of values when working with 
this criterion; in particular, we used (3 G {10,100} 
for the gene data, /3 G {100,1000} for the "derma- 
tology" data, (3 G {1,10,100} for the "vowel" data, 
13 G {500,1000} for NIST data, (3 G {10,100,1000} 
for the "letter" data, and (3 G {10, 100, 1000} for the 
"image segmentation" data. 

To evaluate the performance of the various clus- 
tering algorithms we employed the Rand index (RI) 
(Rand (1971)). Given a set of n objects S = {0±, . . . , O n }, 
suppose that U = {U± , . . . , U r } and V = {V±, . . . , V s } 
are two different partitions of the objects in S such 
that ULi Ui = S = \J s j=1 Vj and U i r\U v = = Vj n 
Vji for i ^ i! and j ^ j' . Let a be the number of pairs 
of objects that are in the same set in U and in the 
same set in V, and b the number of pairs of objects 
that are in different sets in U and in different sets 
in V . The Rand index is given by RI = (o + b)/ (^) • 
If RI = 1, the two partitions are identical. 

Since the ground-truth partitions are available for 
our six data sets, we directly calculated RI between 
the true partition and the partition obtained from 
each clustering algorithm. We conducted 50 repli- 
cates of each of the algorithms that require ran- 
dom initialization (this is not necessary for YS-2 
and Margin-2, which are initialized to the identity 
matrix). Note that for the RCUT and minimum- 
variance criteria, BJ-wkm and BJ-km become iden- 
tical because in these cases II = I n . 

7.2 Performance Analysis 

Figure 2 displays the results for all six algorithms 
using the Ncut criterion. We see that the margin- 
based algorithms are competitive with the other al- 
gorithms. The poorest performer in this setting is 
BJ-wkm, which is highly sensitive to the value of f3. 
In particular, when (3 = 10 for the "gene" data set, 
(3 G {10, 100} for the vowel data, f3 G {1000, 100, 10} 
for the "letter" data, and f3 = 1000 for both the "der- 
matology" and "NIST" data sets, this algorithm al- 
most failed. A possible interpretation for this result 
is the conflict between orthogonality and piecewise 
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Fig. 2. Clustering results (Rand index) with normalized cuts. "BJ-WKM" : the weighted K -means rounding of Bach and 
Jordan 2006; "BJ-KM": the K -means rounding of Bach and Jordan 2006; "YS-1": the rounding scheme of Yu and Shi 2003 
with the orthogonal initialization method; "YS-2": the rounding scheme of Yu and Shi 2003 with initialization to the identity 
matrix; "Margin-1" : the rounding scheme in Section 3.1 with the orthogonal initialization method; "Margin-2" : the rounding 
scheme in Section 3.1 with initialization to the identity matrix. 
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Fig. 3. Clustering results (Rand index) with ratio cuts. See the caption of Figure 2 for explanation of the acronyms. 
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Fig. 4. Clustering results (Rand index) with the minimum-variance criterion. See the caption of Figure 2 for explanation of 
the acronyms. 
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constancy implied in the Ncut setting (see Propo- 
sition 1). Indeed, as can be seen from Figure 2, the 
situation is more favorable for the BJ-km round- 
ing algorithm; in this case D _12 Y(Y / D _1 Y) _1//2 is 
used, which diminishes the conflict between orthog- 
onality and piecewise constancy. Similarly, the con- 
flict is diminished for the YS rounding algorithms 
and our margin-based rounding methods (because 

argmaxjdj l ^ 2 y%j is equivalent to arg maXj- yij). 

Recall that the YS-1 and YS-2 algorithms need 
to use a heuristic post-processing procedure; that is, 
the algorithms operate on Z = dg(ZZ')~ 1 / 2 Z. We 
found that the performance of the algorithms de- 
pends strongly on this procedure. 

Figures 3 and 4 display the experimental results 
using the RCUT and minimum-variance criteria, re- 
spectively. We see again that the margin-based al- 
gorithms are competitive with the other algorithms; 
indeed for several of the data sets the margin-based 
algorithms yield better performance than the other 
algorithms. 

We see from Figures 3 and 4 that BJ-km is com- 
petitive with the other algorithms. This shows that 
the choice of L given in (7.1) is an effective choice. 

We again found it to be the case that the heuristic 
post-processing procedure was needed for YS-1 and 
YS-2 to yield good clustering performance. 

The performances of Margin-1 and Margin-2 were 
similar across the data sets and criteria, showing the 
relative insensitivity of the margin-based approach 
to the initialization. Note in particular the larger 
degree of variability between the performances of 
YS-1 and YS-2. Note also that the margin-based 
approach was in general less sensitive to the value 
of f3 than the other algorithms. 

Finally, recall that L in (7.1) for the RCUT setting 
and L = K + obtained from the minimum-variance 
setting are positive semidefinite but they are not 
Laplacian matrices, because the off-diagonal elements 
of the W = L — D are possibly negative. Nonethe- 
less, our experimental results showed that these two 
choices are still effective. Thus cuts can be defined 
through non-Laplacian matrices. Although such cuts 
lose their original interpretation in terms of the graph 
partition, as we have shown they do have a clear sta- 
tistical interpretation in terms of Gaussian intrinsic 
autoregression models. 

8. DISCUSSION 

In this paper we have presented a margin-based 
perspective on multiway spectral clustering. We have 



shown that both aspects of spectral clustering — relaxa- 
tion and rounding — can be given an interpretation 
in terms of margins. The major advantage of this 
perspective is that it ties spectral clustering to the 
large literature on margin-based classification. The 
margin-based perspective has several additional con- 
sequences: (1) it permits a deeper understanding 
of the relationship between the normalized cut and 
ratio cut formulations of spectral clustering; (2) it 
strengthens the connections between the minimum- 
variance criterion and spectral clustering; and (3) it 
yields a statistical interpretation of spectral cluster- 
ing in terms of Gaussian intrinsic autoregressions. 
Also, the preliminary empirical evidence that we 
presented suggests that the algorithms motivated by 
the margin-based perspective are competitive with 
existing spectral clustering algorithms. 

One of the most useful consequences of the margin- 
based perspective is the interpretation that it yields 
of spectral clustering in terms of projection onto hy- 
perplanes in a reproducing kernel Hilbert space (see 
Figure 1). This interpretation shows that the per- 
formance of the margin-based clustering algorithms 
depends on the separability of the feature vectors. 
This suggests that the algorithmic problem of choos- 
ing the similarity matrix W or kernel matrix K so 
as to increase separability is an important topic for 
further research; see Bach and Jordan (2006) and 
Meila and Shi (2000) for initial work along these 
lines. 

Although we have focused on undirected graphs 
in our treatment, it is also worth noting the possi- 
bility of considering clustering in a directed graph 
with the asymmetric weighted matrix D -1 W (Meila 
and Pentney (2007)). This can be related to our dis- 
cussion in Section 6, where we suggested the use of 
the matrix L = (I n — D^W)'^ - D^W) in the 
RCUT setting. The experimental results in Section 7 
showed that such a suggestion is promising. More- 
over, although L is no longer Laplacian, the corre- 
sponding spectral relaxation can be interpreted as 
a simultaneous autoregression model. The relation- 
ship between simultaneous autoregression and con- 
ditional autoregression (Ripley (1981)) may provide 
connections between spectral clustering in undirected 
graphs and directed graphs. We intend to explore 
this issue in future work. 

In delineating a relationship between the Pcut 
criterion and the kernel minimum- variance criterion, 
we have proven that the relaxation problems (2.3) 
and (4.2) have the same solution whenever rk(L) = 
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?7,— l and L + = K. This leads to the question as to 
whether the original unrelaxed problems — that is, 
the minimization of Pcut and the maximization of 
T with respect to discrete partition matrix E — have 
the same solution under the conditions rk(L) =n — l 
and L + = K. This is currently an open problem. 

APPENDIX 

A.l Proof of Proposition 1 

Since the columns of Y are piecewise constant 
with respect to the partition E, we can express Y as 
Y = E* for some * £ M cx ( c - 1 ). Let Y = II 1 / 2 Y, 
*o = [*,«l c ], a cxc matrix, and Z = [Y , all 1 ' 2 l n ], 
where a = 1 /y/l' n Ul n . We have n _1/2 Z = E* and 
Z'Z = [Y , an l / 2 l n ]'[Y Q ,an l / 2 l n ] = I c due to El c = 
l n , Y' Y = Y'UY = I c _! and Y^II 1 / 2 ^ = Y'IIl n = 
0. Furthermore, we have \I>' E'nE\l>o = Z'Z = I c . 
Since \l/o an d E'lIE are square, \I/o an d E'lIE are 
invertible. Hence * *o = (E'lIE) -1 . We now have 

tr(Y'LY) = tr(Y[,n -1 / 2 Ln~ 1 / 2 Y ) 

= tr(Z'n _1 / 2 Ln- 1 / 2 Z) = tr(*' E'LE* ) 
= tr(E'LE*o*'o) = tr(E'LE(E'nE) -1 ), 

completing the proof. 

A. 2 The Proof of Proposition 2 

In this section we provide a constructive proof of 
Proposition 2 by establishing the existence of \I/ . We 
also provide an example of the construction in the 
special case of c — 4 and II — I n . 

Let (E'lIE) -1 = diag(l//3i, . . . , 1//3 C ) and /3 = (ft, 
. . . ,/3 c )'. We then have l' n Ul n = n'l n = j3'l c and 
E'IIl n = (3. In the proof in Appendix A.l, we ob- 
tain * *o = (E'lIE) -1 . Thus, 

**' = diag(l/ft, . . . , 1//3 C ) - -4-lcl' c 

(denoted A). 

In order to make the above equation hold, it is nec- 
essary for A to be positive semidefinite. Given any 
nonzero b = (b\ , . . . , b c )' £R C , we have 



to obtain A/3 = 0. Using the SVD of A, we are al- 
ways able to obtain a \l/ such that \E r \E r ' = A and 
\E r ' '(3 = 0. Consequently, we have 



l^IIE* = /3'* = and *'E'nE* = I 



c-l- 



The latter equality comes from 
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since the function f(x) = x 2 is convex. This implies 
that A positive semidefinite. Furthermore, it is easy 



It is easily verified that Y = E\I/ satisfies the condi- 
tions (a)-(c) listed in Proposition 1. Let ai,...,a c 
denote the row vectors of We note that an arbi- 
trary collection of c — 1 vectors from the set ai , . . . , a c 
are linearly independent. The convex hull of ai, . . . , a c 
is thus a (c — l)-dimensional simplex. (A d-dimensional 
simplex is the convex hull of an affinely independent 
point set in M. d . A regular c?-dimensional simplex is 
the convex hull of d+ 1 points with all pairs of points 
having equal distances.) In addition, we have that 
the squared distance between a^ and &j is 

K _ M . = I + I & „^-. 

Vi Vj 

Note that we have 77 = n and r/j = rij when II = I n . 
In particular, if II = I n and n± = • ■ • = n c = — , the 
aj constitute the vertices of a (c — l)-dimensional 
regular simplex. 
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A. 3 The Proof of Theorem 1 

This theorem is a variation on a standard result in 
linear algebra; for completeness we present a proof. 
Let S = n _1 / 2 LII _1 / 2 and consider the following 
Lagrangian: 

L(Y ,A,b) 

= tr(Y SY ) - tr(A(Y Y - 1^)) - b'Y n 1/2 l„., 

where A is a (c — 1) x (c — 1) symmetric matrix of 
Lagrange multipliers and b is a (c — 1) x 1 vector of 
Lagrange multipliers. We differentiate to obtain 

dL 



c-l 



c-l 



^2 <t>'i£l<t>i - ^7i+10i<& 



i=l 



i=l 



5Y 

dL 



2SY - 2Y A - n 1/2 l n b'. 



Letting = leads to 

2SY - 2Y A - n 1/2 l n b' = 0, 
from which we have 

2i' n n 1 / 2 SY - 2i;n 1 / 2 Y A - i' n ni n b' = o. 

This implies b = 0. Accordingly, we obtain 
SY = Y A. 

We now take the eigendecomposition of A, letting 
A = Q'TiQ where Q is a (c — 1) x (c — 1) orthonor- 
mal matrix and T\ is a (c — 1) x (c — 1) diagonal 
matrix. We note that the diagonal entries of T\ and 
the columns of YoQ' are the eigenvalues and the 

1 /9 

associated eigenvectors of S. Clearly, II 1 l n is the 
eigenvector of S associated with eigenvalue 0. We 
now let F\ = diag(72, . . . , 7 C ). We thus have Yo = 
[/x 2 > ■ • - ; A*c]Q- Obviously, Yo satisfies Y Yo = I c _i 
and Y' Q U l / 2 l n = due to /Li / i n 1 / 2 l n = for i ^ 1. 

To verify that Yo is the solution of problem (2.4), 
we consider the Hessian matrix of L with respect to 
Y . Let vec(Y ) = (yu, ■ ■ ■ ,yi,c-i,y2i, ■ ■ ■ ,y n ,c-i)' ■ 
The Hessian matrix is then given by 

d 2 L 

H ( Y °)= * z-77=I c -i^S-A®I n . 

avec(Y ) avec(Y ) 

Let B be an arbitrary nonzero nx (c-l) matrix 
such that B'[/i 1 , . . . , fi c ] = 0. We can always express 
B = [n c+1 , . . . , MrJ* where * = [<j> u . . . , 4> c _ 1 ] is an 
(n — c) x (c— 1) matrix. Denoting T 2 = diag(7 c+ i, . . . , 
7 n ), we have 

vec((BQ)')'H(Y )vec((BQ) / ) 
= ^(Q'B'SBQ) - ti^AQ'B'BQ) 
= tr(B'SB) - tr(riB'B) = tr(#T 2 *) - tr(ri*'*) 



c-l 

= ^0' i (r 2 - 7m i n _ c )0 l >o. 

8=1 

If 7 C > 7c+i, then the matrices T 2 — 7j + il n _ c , i = 
1, . . . , c — 1, are positive definite. Thus, the above 
inequality is strict. This shows that Yo is a strict 
local minimum oftr(Y n- 1 / 2 Ln- 1 / 2 Y ) under the 
conditions Y Q Y = I c _i and Y II 1/2 l n = 0. 

A.4 The Solution of Problem (4.3) 

Let T = n 1 / 2 H' 7r KH 7r n 1 / 2 and consider the fol- 
lowing Lagrangian: 

L(Y ,A,b) 

= tr(Y TY ) - tr(A(Y Y - I c _i)) - b'Y n 1/2 l„, 

where A is a (c — 1) x (c — 1) symmetric matrix of 
Lagrange multipliers and b is a (c — 1) x 1 vector of 
Lagrange multipliers. Differentiating, we obtain 

— - = 2TY - 2Y A - U^ 2 l n b'. 
8Y 

Letting -J^ = leads to 

2TY - 2Y A - T^^lnh' = 0, 

from which we have 

2i' n n 1 / 2 TY - 2i;n 1 / 2 Y A - i' n ni n b' = 0. 



Since i;n x / 2 T = l^nH^KH.n 1 / 2 = l' n H,nK • 
H^II 1 / 2 = 0, we obtain b = 0. This implies 

TY = Y A. 

Now following the proof in Appendix A. 3, we find 
that the top c — 1 eigenvectors of T provide the so- 
lution for Yo in problem (4.3). 

A. 5 The Proof of Theorem 3 

Our proof is based on the following lemma. 

Lemma 1. Assume that A is annxn symmetric 
matrix with rk(A) = n — 1 and Al n = 0. Let A + be 
the MP inverse of A. Then n^H^A+H^II 1 / 2 is 
the MP inverse of HT 1/2 ATT 1 / 2 . 

PROOF. We first prove A+A = AA+ = H n . Let 
N = A'A. It is clear that NH n = H n N = N. It thus 
follows from Corollary 4.5.18 in Horn and Johnson 
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(1985) that there exists annxii orthonormal matrix 
U such that 



U'NU: 



1 







where A n _i is an (n — 1) x (n — 1) diagonal matrix 
with positive diagonal entries, and U = [Ui, -y^ln] 
with UjUi = I n _i and Uil n = 0. Here we use the 
fact that l n is the eigenvector of N and of H n with 
associated eigenvalue 0. Accordingly, we have 

N = U 1 A n _ 1 U / 1 and H. n = U 1 XJ' 1 , 

from which it follows that 

n+ = u 1 a^ 1 u; 

and hence N + N = UiU^ = H n . On the other hand, 
since A+ = (A'A)+A', we have A+A = N+N = 
H n . Since A is symmetric, we also have A A+ — H n . 

Using the identity A + A = AA + = H n and AH' n = 
A = A, we have 

n-^An-^n^^'.A+H.n 1 / 2 
= n- 1 / 2 H w n 1 / 2 = n 1 / 2 H;n- 1 /2 

= nVXA+H^n^n-^An- 1 / 2 . 

We further obtain 

n-i^An-^n^H'^A+H^n^n-vsAn- 1 ^ 
= n- 1 / 2 An- 1 /2 

and 

n 1 / 2 H;A+H,n 1 / 2 n- 1 / 2 An- 1 / 2 n 1 / 2 H;A+ 
H w n 1 / 2 = n 1 / 2 H;A+H w n 1 / 2 . 

Thus n 1 / 2 ^ A+H^n 1 / 2 is the MP inverse of IT 1 / 2 

• An- 1 / 2 . □ 

Since L + is the MP inverse of L, L + is positive 
semidefinite and it satisfies L + l n = and rk(L + ) = 
n - 1. It is obvious that rk(n- 1/2 LIl~ 1/2 ) = n - 1 
and rk(n 1 / 2 H / 7r L+H T n 1 / 2 ) = n - 1. Moreover, 
n 1/2 l n is eigenvector of both Ur 1/2 lAlr l/2 and 
Il 1 ' /2 H / 7r L + H 7r n 1//2 with associated eigenvalue 0. In 
addition, if A ^ is eig envalue of U-^LU' 1 / 2 with 
associated eigenvector u, then A -1 is eigenvalue of 
n 1 / 2 H^ r L + H 7r II 1 / 2 with associated eigenvector u. It 
thus follows from Lemma 1 that (4.3) has the same 
solution as (2.4) whenever L + = K. As a result, (4.2) 
has the same solution as (2.3). 
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